Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

نویسندگان

Rizwan A. Ashraf

Saurabh Hukerikar

Christian Engelmann

چکیده

Efficient utilization of today’s high-performance computing (HPC) systems with complex hardware and software components requires that the HPC applications are designed to tolerate process failures at runtime. With low mean-time-tofailure (MTTF) of current and future HPC systems, long running simulations on these systems require capabilities for gracefully handling process failures by the applications themselves. In this paper, we explore the use of fault tolerance extensions to Message Passing Interface (MPI) called user-level failure mitigation (ULFM) for handling process failures without the need to discard the progress made by the application. We explore two alternative recovery strategies, which use ULFM along with application-driven in-memory checkpointing. In the first case, the application is recovered with only the surviving processes, and in the second case, spares are used to replace the failed processes, such that the original configuration of the application is restored. Our experimental results demonstrate that graceful degradation is a viable alternative for recovery in environments where spares may not be available.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Towards Adaptive Resilience in High Performance Computing

With the current growth in computing capabilities of high performance computing (HPC) systems, Exascale HPC systems are expected to arrive by 2020 [1]. As systems become larger and more complex, they also become more error prone [2]. The failure rate of HPC systems rapidly increases, such that, failures become the norm rather than the exception. Therefore, in such unreliable environment, to mai...

متن کامل

Dealing with Failures During Failure Recovery of Distributed Systems ; CU-CS-1009-06

One of the characteristics of autonomic systems is self recovery from failures. Self recovery can be achieved through sensing failures, planning for recovery and executing the recovery plan to bring the system back to a normal state. For various reasons, however, additional failures are possible during the process of recovering from the initial failure. Handling such secondary failures is impor...

متن کامل

Concurrent Checkpointing and Recovery in Distributed Systems

The main objective of this paper is to speed up the consistent state restoration of distributed systems. Process recovery uses vector time to address unusual message handling issues and overlapping failures. Single rollback of non-failed process in response to a single failure has low message complexity. After a failure, processes required to rollback do so concurrently, which substantially dec...

متن کامل

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest that very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to ca...

متن کامل

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

Fault tolerance overhead of high performance computing (HPC) applications is becoming critical to the efficient utilization of HPC systems at large scale. HPC applications typically tolerate fail-stop failures by checkpointing. Another promising method is in the algorithm level, called algorithmic recovery. These two methods can achieve high efficiency when the system scale is not very large, b...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1801.04523 شماره

صفحات -

تاریخ انتشار 2018

Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery

نویسندگان

چکیده

منابع مشابه

Towards Adaptive Resilience in High Performance Computing

Dealing with Failures During Failure Recovery of Distributed Systems ; CU-CS-1009-06

Concurrent Checkpointing and Recovery in Distributed Systems

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

عنوان ژورنال:

اشتراک گذاری